{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import checklist\n",
    "from checklist.editor import Editor\n",
    "from checklist.perturb import Perturb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating examples from existing datasets via perturbations "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "editor = Editor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by creating a fictitious dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = ['John is a very smart person, he lives in Ireland.',\n",
    "        'Mark Stewart was born and raised in Chicago',\n",
    "        'Luke Smith has 3 sisters.',\n",
    "        'Mary is not a nurse.',\n",
    "        'Julianne is an engineer.',\n",
    "        'My brother Andrew used to be a lawyer.']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Writing your own perturbations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say we want to write a perturbation function to replace some professions with other professions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "def change_professions(x, *args, **kwargs):\n",
    "    # Returns empty or a list of strings with profesions changed\n",
    "    professions = ['doctor', 'nurse', 'engineer', 'lawyer']\n",
    "    ret = []\n",
    "    for p in professions:\n",
    "        if re.search(r'\\b%s\\b' % p, x):\n",
    "            ret.extend([re.sub(r'\\b%s\\b' % p, p2, x) for p2 in professions if p != p2])\n",
    "    return ret\n",
    "            "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mary is not a doctor.', 'Mary is not a engineer.', 'Mary is not a lawyer.']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "change_professions(data[3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could use this function on every example in `data`, and keep only cases where it applies.\n",
    "There is an auxiliary function that does this (and more) for us, called `Perturb.perturb`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Mary is not a nurse.',\n",
       "  'Mary is not a doctor.',\n",
       "  'Mary is not a engineer.',\n",
       "  'Mary is not a lawyer.'],\n",
       " ['Julianne is an engineer.',\n",
       "  'Julianne is an doctor.',\n",
       "  'Julianne is an nurse.',\n",
       "  'Julianne is an lawyer.'],\n",
       " ['My brother Andrew used to be a lawyer.',\n",
       "  'My brother Andrew used to be a doctor.',\n",
       "  'My brother Andrew used to be a nurse.',\n",
       "  'My brother Andrew used to be a engineer.']]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(data, change_professions, keep_original=True)\n",
    "ret.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how `Perturb.perturb` automatically ignored examples in our dataset where the perturbation didn't return anything, e.g. 'John is a very smart person'.  \n",
    "We set `keep_original=True`, and therefore the original data point is kept as the first in every example list. This is typically what we want to do in perturbation tests. This is what we would get if we had set it to `False`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Mary is not a doctor.', 'Mary is not a engineer.', 'Mary is not a lawyer.'],\n",
       " ['Julianne is an doctor.', 'Julianne is an nurse.', 'Julianne is an lawyer.'],\n",
       " ['My brother Andrew used to be a doctor.',\n",
       "  'My brother Andrew used to be a nurse.',\n",
       "  'My brother Andrew used to be a engineer.']]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(data, change_professions, keep_original=False)\n",
    "ret.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also specify a number of samples if our dataset is too large:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Mary is not a doctor.', 'Mary is not a engineer.', 'Mary is not a lawyer.']]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(data, change_professions, keep_original=False, nsamples=1)\n",
    "ret.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we may want our perturbation function to return some metadata. In our case, maybe we want to remember which profession was swapped into which profession. To do so, let's rewrite `change_professions` so that it returns an additional list with metadata:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def change_professions(x, meta=False, *args, **kwargs):\n",
    "    # Returns empty or a list of strings with profesions changed\n",
    "    professions = ['doctor', 'nurse', 'engineer', 'lawyer']\n",
    "    ret = []\n",
    "    ret_meta = []\n",
    "    for p in professions:\n",
    "        if re.search(r'\\b%s\\b' % p, x):\n",
    "            ret.extend([re.sub(r'\\b%s\\b' % p, p2, x) for p2 in professions if p != p2])\n",
    "            ret_meta.extend([(p, p2) for p2 in professions if p != p2])\n",
    "    if meta:\n",
    "        return ret, ret_meta\n",
    "    else:\n",
    "        return ret\n",
    "            "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(['Mary is not a doctor.', 'Mary is not a engineer.', 'Mary is not a lawyer.'],\n",
       " [('nurse', 'doctor'), ('nurse', 'engineer'), ('nurse', 'lawyer')])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "change_professions(data[3], meta=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now call `Perturb.perturb` with `meta=True`. Whatever keyword arguments we use in `Perturb.perturb` get passed along to the perturbation function. In addition, if we set `meta=True`, `ret.meta` will have the metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data\n",
      "[['Mary is not a nurse.', 'Mary is not a doctor.', 'Mary is not a engineer.', 'Mary is not a lawyer.']]\n",
      "Metadata\n",
      "[[None, ('nurse', 'doctor'), ('nurse', 'engineer'), ('nurse', 'lawyer')]]\n"
     ]
    }
   ],
   "source": [
    "ret = Perturb.perturb(data, change_professions, keep_original=True, nsamples=1, meta=True)\n",
    "print('Data')\n",
    "print(ret.data)\n",
    "print('Metadata')\n",
    "print(ret.meta)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### General-purpose perturbations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We provide some general-purpose perturbation functions. Some assume you have preprocessed the data with spacy, so let's do that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load('en_core_web_sm')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdata = list(nlp.pipe(data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Punctuation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Perturb.strip_punctuation` removes punctuation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(John is a very smart person, he lives in Ireland.,\n",
       " 'John is a very smart person, he lives in Ireland')"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdata[0], Perturb.strip_punctuation(pdata[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Perturb.punctuation` adds and / or removes punctuation (notice that we add it when it's not there and remove it when it is)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['John is a very smart person, he lives in Ireland.',\n",
       "  'John is a very smart person, he lives in Ireland'],\n",
       " ['Mark Stewart was born and raised in Chicago',\n",
       "  'Mark Stewart was born and raised in Chicago.'],\n",
       " ['Luke Smith has 3 sisters.', 'Luke Smith has 3 sisters'],\n",
       " ['Mary is not a nurse.', 'Mary is not a nurse']]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(pdata, Perturb.punctuation)\n",
    "ret.data[:4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Typos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('John is a very smart person, he lives in Ireland.',\n",
       " 'John is a very smar tperson, he lives in Ireland.')"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[0], Perturb.add_typos(data[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Luke Smith has 3 sisters.', 'Luke Smith has 3 sistres.']]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(data, Perturb.add_typos, nsamples=1)\n",
    "ret.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Contractions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Perturb.expand_contractions` and `Perturb.contract` act on a single string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('Mary is not a nurse.', \"Mary isn't a nurse.\")"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[3], Perturb.contract(data[3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'What is going on?'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Perturb.expand_contractions('What\\'s going on?')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Perturb.contractions` contracts AND expands contractions if possible:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['What is going on? I am not happy', \"What's going on? I'm not happy\"]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Perturb.contractions('What\\'s going on? I am not happy')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Mary is not a nurse.', \"Mary isn't a nurse.\"]]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(data, Perturb.contractions)\n",
    "ret.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Changing named entities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following functions all assume you have parsed the input with spacy.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perturb.change_names allows you to replace person names automatically.  \n",
    "You can specify if you only want first names, first and last names, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Luke Smith has 3 sisters.',\n",
       "  'Michael White has 3 sisters.',\n",
       "  'Christopher Lewis has 3 sisters.',\n",
       "  'Matthew Phillips has 3 sisters.',\n",
       "  'David Hall has 3 sisters.',\n",
       "  'James White has 3 sisters.',\n",
       "  'John Rogers has 3 sisters.',\n",
       "  'Joshua Collins has 3 sisters.',\n",
       "  'Daniel Mitchell has 3 sisters.',\n",
       "  'Joseph Walker has 3 sisters.',\n",
       "  'William Smith has 3 sisters.']]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(pdata[2:3], Perturb.change_names, nsamples=1)\n",
    "ret.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also specify how many replacements with `n` (default is 10):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Julianne is an engineer.',\n",
       "  'Melanie is an engineer.',\n",
       "  'Sydney is an engineer.',\n",
       "  'Morgan is an engineer.']]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(pdata, Perturb.change_names, nsamples=1, n=3)\n",
    "ret.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Mark Stewart was born and raised in Chicago',\n",
       "  'Jesse Stewart was born and raised in Chicago',\n",
       "  'Shawn Stewart was born and raised in Chicago',\n",
       "  'Michael Stewart was born and raised in Chicago']]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(pdata, Perturb.change_names, nsamples=1, first_only=True, n=3)\n",
    "ret.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Luke Smith has 3 sisters.',\n",
       "  'Luke Powell has 3 sisters.',\n",
       "  'Luke Watson has 3 sisters.',\n",
       "  'Luke Brown has 3 sisters.']]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(pdata, Perturb.change_names, nsamples=1, last_only=True, n=3)\n",
    "ret.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also set `meta=True` if you want to save the change in the metadata:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(['Mark Walker was born and raised in Chicago',\n",
       "  'Mark Collins was born and raised in Chicago',\n",
       "  'Mark Allen was born and raised in Chicago'],\n",
       " [('Stewart', 'Walker'), ('Stewart', 'Collins'), ('Stewart', 'Allen')])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(pdata, Perturb.change_names, nsamples=1, n=3, last_only=True, meta=True)\n",
    "ret.data[0][1:], ret.meta[0][1:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, you can change locations with `Perturb.change_location`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(['Mark Stewart was born and raised in Chicago',\n",
       "  'Mark Stewart was born and raised in Madison',\n",
       "  'Mark Stewart was born and raised in Wichita',\n",
       "  'Mark Stewart was born and raised in Newark'],\n",
       " [None, ('Chicago', 'Madison'), ('Chicago', 'Wichita'), ('Chicago', 'Newark')])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(pdata, Perturb.change_location, nsamples=1, n=3, meta=True)\n",
    "ret.data[0], ret.meta[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And numbers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(['Luke Smith has 3 sisters.',\n",
       "  'Luke Smith has 2 sisters.',\n",
       "  'Luke Smith has 2 sisters.',\n",
       "  'Luke Smith has 4 sisters.'],\n",
       " [None, ('3', '2'), ('3', '2'), ('3', '4')])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(pdata, Perturb.change_number, nsamples=1, n=3, meta=True)\n",
    "ret.data[0], ret.meta[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Negation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have some experimental functions (may or may not work) for adding or removing negations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['John is a very smart person, he lives in Ireland.',\n",
       "  \"John is a very smart person, he doesn't live in Ireland.\"],\n",
       " ['Mark Stewart was born and raised in Chicago',\n",
       "  'Mark Stewart was not born and raised in Chicago'],\n",
       " ['Luke Smith has 3 sisters.', \"Luke Smith doesn't have 3 sisters.\"],\n",
       " ['Julianne is an engineer.', 'Julianne is not an engineer.']]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ret = Perturb.perturb(pdata, Perturb.add_negation)\n",
    "ret.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is not good\n",
      "This is good\n",
      "\n",
      "He didn't play the guitar\n",
      "He played the guitar\n",
      "\n",
      "He doesn't play anything\n",
      "He plays anything\n",
      "\n",
      "She wasn't sad\n",
      "She was sad\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for t in ['This is not good', 'He didn\\'t play the guitar', 'He doesn\\'t play anything', 'She wasn\\'t sad']:\n",
    "    print(t)\n",
    "    print(Perturb.remove_negation(nlp(t)))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "checklist",
   "language": "python",
   "name": "checklist"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
